Course Name: CSC-592-U17-Top: Internet of Things¶
Project Name: A Machine Learning Approach to Sleep Quality Classification using Logistic Regression, Support Vector Machine, Artificial Neural Network, and Random Forest on IoT Dataset¶
Contributor Names: Abdullah Al Rakin & Mohammad Navid Nayyem¶

Import all necessary libraries¶

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score, train_test_split, cross_val_predict
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.exceptions import ConvergenceWarning
from sklearn.neural_network import MLPClassifier
from imblearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Dropout

Load the dataset¶

In [2]:
#df = pd.read_csv('C:/Users/moham/OneDrive/Desktop/Sleep_Dataset_Merged.csv')
df = pd.read_csv('/Users/abdullahalrakin/Desktop/IoT/Project/Dataset/Sleep_Dataset_Merged.csv')

Preview the dataset¶

In [3]:
df
Out[3]:
ID Age Gender Sleep duration Sleep efficiency REM_sleep_ratio Deep_sleep_ratio Light_sleep_ratio Sleep_debt Sleep_latency Total_weekly_steps steps_duration_in_hr
0 1 65 Female 6.0 0.88 3.00 11.67 2.00 2.00 6.00 31375 1632
1 2 69 Male 7.0 0.66 2.71 4.00 7.57 1.00 7.00 43210 2668
2 3 40 Female 8.0 0.89 2.50 8.75 1.25 0.00 16.00 60815 1195
3 4 40 Female 6.0 0.51 3.83 4.17 8.67 2.00 6.00 62936 2187
4 5 57 Male 8.0 0.76 3.38 6.88 2.25 0.00 8.00 35529 1448
... ... ... ... ... ... ... ... ... ... ... ... ...
821 822 59 Female 8.1 0.90 2.93 4.62 7.63 1.89 18.97 49000 2042
822 823 59 Female 8.0 0.90 3.21 12.45 1.35 1.66 9.90 49000 2042
823 824 59 Female 8.1 0.90 3.85 6.24 2.06 -0.70 17.07 49000 2042
824 825 59 Female 8.1 0.90 4.27 11.09 6.58 0.65 11.66 49000 2042
825 826 59 Female 8.1 0.90 4.82 6.18 5.24 1.89 7.16 49000 2042

826 rows × 12 columns

Define weights for each column¶

In [4]:
weights = {
    'Sleep duration': 0.1,
    'Sleep efficiency': 0.1,
    'Age' : 0.1,
    'REM_sleep_ratio': 0.1,
    'Deep_sleep_ratio': 0.1,
    'Light_sleep_ratio': 0.1,
    'Sleep_debt': 0.1,
    'Sleep_latency': 0.1,
    'Total_weekly_steps': 0.1,
    'steps_duration_in_hr': 0.1
}

Sleep quality score calculation¶

In [5]:
numerical_cols = ['Age','Sleep duration', 'Sleep efficiency', 'REM_sleep_ratio', 'Deep_sleep_ratio', 
                  'Light_sleep_ratio', 'Sleep_debt', 'Sleep_latency', 'Total_weekly_steps', 
                  'steps_duration_in_hr']

df['sleep_quality_score'] = df[numerical_cols].apply(lambda row: sum(row[col] * weights[col] for col in row.index), axis=1)

Sleep quality score rescaling¶

In [6]:
scaler = MinMaxScaler(feature_range=(1, 5))
df['sleep_quality_score'] = scaler.fit_transform(df[['sleep_quality_score']].values).round(2)

Categorizing sleep quality scores¶

In [7]:
def map_to_category(score):
    if score >= 1 and score <= 1.5:
        return "Very Poor"
    elif score > 1.5 and score <= 2.5:
        return "Poor"
    elif score > 2.5 and score <= 3.5:
        return "Average"
    elif score > 3.5 and score <= 4.5:
        return "Good"
    else:
        return "Excellent"

df['sleep_quality_category'] = df['sleep_quality_score'].apply(map_to_category)

Display the "sleep_quality_score" & "sleep_quality_category"¶

In [8]:
df[["sleep_quality_score", "sleep_quality_category"]]
Out[8]:
sleep_quality_score sleep_quality_category
0 1.87 Poor
1 2.88 Average
2 4.14 Good
3 4.39 Good
4 2.18 Poor
... ... ...
821 3.29 Average
822 3.29 Average
823 3.29 Average
824 3.29 Average
825 3.29 Average

826 rows × 2 columns

Visualization 01: Pie chart of Gender distribution¶

In [9]:
gender_counts = df['Gender'].value_counts().reset_index()

fig = px.pie(gender_counts, values='Gender', names='index',
             title='Gender Distribution',
             color_discrete_sequence=['lightcoral', 'lightskyblue'],
             labels={'index': 'Gender'})
fig.update_traces(hoverinfo='label+percent', textinfo='percent')

fig.show()

Visualization 02: Donut Chart of Average Sleep duration by Gender¶

In [10]:
avg_sleep_duration = df.groupby('Gender')['Sleep duration'].mean().reset_index()

fig = px.pie(avg_sleep_duration, values='Sleep duration', names='Gender', hole=0.5,
             title='Average Sleep Duration by Gender', color='Gender')
fig.update_traces(textinfo='percent+label', textposition='inside', showlegend=True)

fig.show()

Visualization 03: Histogram of Distribution of Age¶

In [11]:
histogram_age = px.histogram(df, x='Age', nbins=20, title='Distribution of Age')

histogram_age.update_traces(histnorm='probability density', marker_color='skyblue', selector=dict(type='histogram'))
histogram_age.update_traces(marker=dict(line=dict(color='black', width=1)))
histogram_age.update_traces(legendgroup=2)
histogram_age.update_layout(xaxis_title='Age', yaxis_title='Density')

histogram_age.show()

Visualization 04: Histogram of Sleep duration¶

In [12]:
fig = px.histogram(df, x='Sleep duration', title='Histogram of Sleep duration')

fig.update_layout(
    xaxis=dict(title='Sleep duration', showline=True, linewidth=1, linecolor='black'),
    yaxis=dict(title='Frequency', showline=True, linewidth=1, linecolor='black'),
    bargap=0.01,
    showlegend=False,
    hovermode='closest',
    hoverlabel=dict(bgcolor="white", font_size=12),
)

fig.show()

Visualization 05: Scatter plot of Age vs. Sleep duration colored by Gender¶

In [13]:
scatter_plot = px.scatter(df, x='Age', y='Sleep duration', color='Gender',
                          title='Age vs. Sleep duration colored by Gender')
scatter_plot.update_traces(marker=dict(size=12, opacity=0.8))

scatter_plot.show()

Visualization 06: Histogram of sleep quality scores¶

In [14]:
fig = px.histogram(df, x='sleep_quality_score', nbins=20, marginal='rug', opacity=0.7, color_discrete_sequence=['#636EFA'])
fig.update_layout(title='Distribution of Sleep Quality Scores', xaxis_title='Sleep Quality Score', yaxis_title='Frequency', bargap=0.1)

fig.show()

Visualization 07: Bar chart of sleep quality categories¶

In [15]:
fig = px.bar(df['sleep_quality_category'].value_counts().reset_index(), x='index', y='sleep_quality_category', color='index')
fig.update_layout(title='Distribution of Sleep Quality Categories', xaxis_title='Sleep Quality Category', yaxis_title='Count')

fig.show()

Visualization 08: Boxplot of sleep quality scores by category¶

In [16]:
fig = px.box(df, x='sleep_quality_category', y='sleep_quality_score', points='all', color='sleep_quality_category')
fig.update_layout(title='Sleep Quality Score Distribution by Category', xaxis_title='Sleep Quality Category', yaxis_title='Sleep Quality Score')

fig.show()

Visualization 09: Distribution of Sleep Efficiency using histogram¶

In [17]:
fig = px.histogram(df, x='Sleep efficiency', nbins=20, marginal='rug', color_discrete_sequence=['#6495ED'])
fig.update_layout(title='Distribution of Sleep Efficiency',
                  xaxis_title='Sleep Efficiency',
                  yaxis_title='Frequency',
                  bargap=0.02)
fig.show()

Visualization 10: Scatter plot of sleep duration vs. sleep efficiency¶

In [18]:
fig = px.scatter(df, x='Sleep duration', y='Sleep efficiency', color='sleep_quality_category', hover_data=['ID'])
fig.update_layout(title='Sleep Duration vs. Sleep Efficiency', xaxis_title='Sleep Duration', yaxis_title='Sleep Efficiency')

fig.show()

Visualization 11: Violin plot of Sleep Efficiency by Sleep Quality Category¶

In [19]:
fig = px.violin(df, x='sleep_quality_category', y='Sleep efficiency', box=True, points="all", color='sleep_quality_category', title='Distribution of Sleep Efficiency by Sleep Quality Category')
fig.update_layout(xaxis_title='Sleep Quality Category', yaxis_title='Sleep Efficiency')

fig.show()

Visualization 12: Correlation Heatmap of numerical features & interactive heatmap¶

In [20]:
plt.figure(figsize=(10, 8))
sns.heatmap(df[['Age', 'Sleep duration', 'Sleep efficiency', 'REM_sleep_ratio', 'Deep_sleep_ratio', 'Light_sleep_ratio', 'Sleep_debt', 'Sleep_latency', 'Total_weekly_steps', 'steps_duration_in_hr', 'sleep_quality_score']].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

corr_matrix = df[['Age', 'Sleep duration', 'Sleep efficiency', 'REM_sleep_ratio', 'Deep_sleep_ratio', 'Light_sleep_ratio', 'Sleep_debt', 'Sleep_latency', 'Total_weekly_steps', 'steps_duration_in_hr', 'sleep_quality_score']].corr()

# Interactive heatmap
fig = ff.create_annotated_heatmap(z=corr_matrix.values,
                                   x=list(corr_matrix.columns),
                                   y=list(corr_matrix.index),
                                   colorscale='RdBu',
                                   annotation_text=corr_matrix.round(2).values,
                                   showscale=True)

fig.update_layout(title='Correlation Heatmap of Numerical Features',
                  xaxis_title='Features',
                  yaxis_title='Features')

fig.show()

Visualization 13: Interactive cross-tabulation of categorical features¶

In [21]:
categorical_features = ['Gender', 'sleep_quality_category']

cross_tab = pd.crosstab(df[categorical_features[0]], df[categorical_features[1]])

fig = ff.create_annotated_heatmap(z=cross_tab.values,
                                   x=list(cross_tab.columns),
                                   y=list(cross_tab.index),
                                   colorscale='YlGnBu',
                                   annotation_text=cross_tab.values,
                                   showscale=True)

fig.update_layout(title='Cross-tabulation Heatmap of Categorical Features',
                  xaxis_title=categorical_features[1],
                  yaxis_title=categorical_features[0])

fig.show()

Separate features & target variable¶

In [22]:
X = df.drop(columns=['ID', 'Gender', 'sleep_quality_category'])
y = df['sleep_quality_category']

Create a pipeline with StandardScaler & Logistic Regression¶

In [23]:
# Define the pipeline for Logistic Regression
lg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# Define the pipeline for Support Vector Machine (SVM)
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC())
])

# Define the pipeline for Artificial Neural Network (ANN)
ann_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', MLPClassifier())
])

# Define the pipeline for Random Forest Classifier
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

Data Standardization & Logistic Regression Training¶

In [24]:
# Suppress ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

# Standardize the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['sleep_quality_score']])

# Fit logistic regression model with scaled data
logistic_model = LogisticRegression(max_iter=1000) 
logistic_model.fit(scaled_features, y)
Out[24]:
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000)

Initiating parameter grid for grid search¶

In [25]:
# Define parameter grid for Logistic Regression
lg_param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver' : ['liblinear']
}

# Define parameter grid for Support Vector Machine (SVM)
svm_param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__kernel': ['linear', 'rbf', 'poly'],
    'classifier__gamma': ['scale', 'auto']
}

# Define parameter grid for Artificial Neural Network (ANN)
ann_param_grid = {
    'classifier__hidden_layer_sizes': [(100,), (50,50), (100,50,25)],
    'classifier__activation': ['logistic', 'relu'],
    'classifier__solver': ['adam'],
    'classifier__alpha': [0.0001, 0.001, 0.01],
    'classifier__learning_rate': ['constant','adaptive'],
}

# Define parameter grid for Random Forest Classifier
rf_param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__bootstrap': [True, False]
}

Cross-Validation with StratifiedKFold¶

In [26]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Grid search with cross-validation¶

In [27]:
# Perform grid search for Logistic Regression
lg_grid_search = GridSearchCV(lg_pipeline, lg_param_grid, cv=skf, scoring='accuracy')
lg_grid_search.fit(X, y)

# Perform grid search for Support Vector Machine (SVM)
svm_grid_search = GridSearchCV(svm_pipeline, svm_param_grid, cv=skf, scoring='accuracy')
svm_grid_search.fit(X, y)

# Perform grid search for Artificial Neural Network (ANN)
ann_grid_search = GridSearchCV(ann_pipeline, ann_param_grid, cv=skf, scoring='accuracy')
ann_grid_search.fit(X, y)

# Perform grid search for Random Forest Classifier
rf_grid_search = GridSearchCV(rf_pipeline, rf_param_grid, cv=skf, scoring='accuracy')
rf_grid_search.fit(X, y)
Out[27]:
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('classifier',
                                        RandomForestClassifier())]),
             param_grid={'classifier__bootstrap': [True, False],
                         'classifier__max_depth': [None, 10, 20],
                         'classifier__min_samples_leaf': [1, 2, 4],
                         'classifier__min_samples_split': [2, 5, 10],
                         'classifier__n_estimators': [50, 100, 200]},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('classifier',
                                        RandomForestClassifier())]),
             param_grid={'classifier__bootstrap': [True, False],
                         'classifier__max_depth': [None, 10, 20],
                         'classifier__min_samples_leaf': [1, 2, 4],
                         'classifier__min_samples_split': [2, 5, 10],
                         'classifier__n_estimators': [50, 100, 200]},
             scoring='accuracy')
Pipeline(steps=[('scaler', StandardScaler()),
                ('classifier', RandomForestClassifier())])
StandardScaler()
RandomForestClassifier()

Print the best parameters & corresponding accuracy¶

In [28]:
# Print results for Logistic Regression
print("\nLogistic Regression (LR) Results:")
print("Best Parameters:", lg_grid_search.best_params_)
print("Best Accuracy:", lg_grid_search.best_score_)

# Print results for Support Vector Machine (SVM)
print("\nSupport Vector Machine (SVM) Results:")
print("Best Parameters (SVM):", svm_grid_search.best_params_)
print("Best Accuracy (SVM):", svm_grid_search.best_score_)

# Print results for Artificial Neural Network (ANN)
print("\nArtificial Neural Network (ANN) Results:")
print("Best Parameters (ANN):", ann_grid_search.best_params_)
print("Best Accuracy (ANN):", ann_grid_search.best_score_)

# Print results for Random Forest Classifier
print("\nRandom Forest Classifier Results:")
print("Best Parameters (Random Forest):", rf_grid_search.best_params_)
print("Best Accuracy (Random Forest):", rf_grid_search.best_score_)
Logistic Regression (LR) Results:
Best Parameters: {'classifier__C': 100, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}
Best Accuracy: 0.8244395764877692

Support Vector Machine (SVM) Results:
Best Parameters (SVM): {'classifier__C': 10, 'classifier__gamma': 'scale', 'classifier__kernel': 'linear'}
Best Accuracy (SVM): 0.9794158451989776

Artificial Neural Network (ANN) Results:
Best Parameters (ANN): {'classifier__activation': 'logistic', 'classifier__alpha': 0.0001, 'classifier__hidden_layer_sizes': (50, 50), 'classifier__learning_rate': 'adaptive', 'classifier__solver': 'adam'}
Best Accuracy (ANN): 0.9612413289521722

Random Forest Classifier Results:
Best Parameters (Random Forest): {'classifier__bootstrap': False, 'classifier__max_depth': 10, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 50}
Best Accuracy (Random Forest): 0.9963636363636365

Display Cross-Validation scores¶

In [29]:
# Cross-validation for Logistic Regression
lg_cv_scores = cross_val_score(lg_grid_search.best_estimator_, X, y, cv=skf)
print("Cross-Validation Scores (LR):", np.array2string(lg_cv_scores, separator=', '))

# Cross-validation for Support Vector Machine (SVM)
svm_cv_scores = cross_val_score(svm_grid_search.best_estimator_, X, y, cv=skf)
print("Cross-Validation Scores (SVM):", np.array2string(svm_cv_scores, separator=', '))

# Cross-validation for Artificial Neural Network (ANN)
ann_cv_scores = cross_val_score(ann_grid_search.best_estimator_, X, y, cv=skf)
print("Cross-Validation Scores (ANN):", np.array2string(ann_cv_scores, separator=', '))

# Cross-validation for Random Forest Classifier
rf_cv_scores = cross_val_score(rf_grid_search.best_estimator_, X, y, cv=skf)
print("Cross-Validation Scores (Random Forest Classifier):", np.array2string(rf_cv_scores, separator=', '))
Cross-Validation Scores (LR): [0.8373494 , 0.81212121, 0.81212121, 0.83030303, 0.84242424]
Cross-Validation Scores (SVM): [0.98192771, 0.96969697, 0.96969697, 0.99393939, 0.98181818]
Cross-Validation Scores (ANN): [0.96385542, 0.98181818, 0.96363636, 0.95151515, 0.92727273]
Cross-Validation Scores (Random Forest Classifier): [1.        , 0.98181818, 0.98787879, 1.        , 1.        ]

Make predictions using cross-validation with the best estimator¶

In [30]:
# Predict using cross-validation for Logistic Regression
lg_predictions = cross_val_predict(lg_grid_search.best_estimator_, X, y, cv=skf)

# Predict using cross-validation for Support Vector Machine (SVM)
svm_predictions = cross_val_predict(svm_grid_search.best_estimator_, X, y, cv=skf)

# Predict using cross-validation for Artificial Neural Network (ANN)
ann_predictions = cross_val_predict(ann_grid_search.best_estimator_, X, y, cv=skf)

# Predict using cross-validation for Random Forest Classifier
rf_predictions = cross_val_predict(rf_grid_search.best_estimator_, X, y, cv=skf)

Display Classification Report & Confusion Matrix¶

In [31]:
# Print classification report for Logistic Regression
print("Classification Report (Logistic Regression):")
print(classification_report(y, lg_predictions))

# Generate confusion matrix for Logistic Regression
lg_conf_matrix = confusion_matrix(y, lg_predictions)
print("Confusion Matrix (Logistic Regression):")
print(lg_conf_matrix)
print("\n")

# Print classification report for Support Vector Machine (SVM)
print("Classification Report (SVM):")
print(classification_report(y, svm_predictions))

# Generate confusion matrix for Support Vector Machine (SVM)
svm_conf_matrix = confusion_matrix(y, svm_predictions)
print("Confusion Matrix (SVM):")
print(svm_conf_matrix)
print("\n")

# Print classification report for Artificial Neural Network (ANN)
print("Classification Report (ANN):")
print(classification_report(y, ann_predictions))

# Generate confusion matrix for Artificial Neural Network (ANN)
ann_conf_matrix = confusion_matrix(y, ann_predictions)
print("Confusion Matrix (ANN):")
print(ann_conf_matrix)
print("\n")

# Print classification report for Random Forest Classifier
print("Classification Report (Random Forest Classifier):")
print(classification_report(y, rf_predictions))

# Generate confusion matrix for Random Forest Classifier
rf_conf_matrix = confusion_matrix(y, rf_predictions)
print("Confusion Matrix (Random Forest Classifier):")
print(rf_conf_matrix)
print("\n")
Classification Report (Logistic Regression):
              precision    recall  f1-score   support

     Average       0.74      0.77      0.76       274
   Excellent       0.99      0.96      0.97        74
        Good       0.78      0.83      0.80       216
        Poor       0.91      0.80      0.85       198
   Very Poor       0.98      0.97      0.98        64

    accuracy                           0.83       826
   macro avg       0.88      0.87      0.87       826
weighted avg       0.83      0.83      0.83       826

Confusion Matrix (Logistic Regression):
[[212   0  49  13   0]
 [  0  71   3   0   0]
 [ 35   1 180   0   0]
 [ 39   0   0 158   1]
 [  0   0   0   2  62]]


Classification Report (SVM):
              precision    recall  f1-score   support

     Average       0.99      0.98      0.98       274
   Excellent       0.97      0.99      0.98        74
        Good       0.98      0.99      0.98       216
        Poor       0.97      0.98      0.97       198
   Very Poor       0.98      0.94      0.96        64

    accuracy                           0.98       826
   macro avg       0.98      0.97      0.98       826
weighted avg       0.98      0.98      0.98       826

Confusion Matrix (SVM):
[[268   0   3   3   0]
 [  0  73   1   0   0]
 [  1   2 213   0   0]
 [  2   0   0 195   1]
 [  0   0   0   4  60]]


Classification Report (ANN):
              precision    recall  f1-score   support

     Average       0.97      0.99      0.98       274
   Excellent       0.95      0.93      0.94        74
        Good       0.97      0.97      0.97       216
        Poor       0.94      0.97      0.96       198
   Very Poor       0.98      0.81      0.89        64

    accuracy                           0.96       826
   macro avg       0.96      0.94      0.95       826
weighted avg       0.96      0.96      0.96       826

Confusion Matrix (ANN):
[[271   0   2   1   0]
 [  0  69   5   0   0]
 [  3   4 209   0   0]
 [  4   0   0 193   1]
 [  0   0   0  12  52]]


Classification Report (Random Forest Classifier):
              precision    recall  f1-score   support

     Average       0.99      1.00      0.99       274
   Excellent       1.00      0.99      0.99        74
        Good       1.00      1.00      1.00       216
        Poor       0.99      0.99      0.99       198
   Very Poor       0.98      0.98      0.98        64

    accuracy                           0.99       826
   macro avg       0.99      0.99      0.99       826
weighted avg       0.99      0.99      0.99       826

Confusion Matrix (Random Forest Classifier):
[[273   0   0   1   0]
 [  0  73   1   0   0]
 [  1   0 215   0   0]
 [  1   0   0 196   1]
 [  0   0   0   1  63]]


Visualization 14: Visualize the confusion matrix with "Predicted" & "Actual"¶

In [32]:
# Plot confusion matrix for Logistic Regression
labels = lg_grid_search.best_estimator_.classes_
plt.figure(figsize=(12, 6))
sns.heatmap(lg_conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Logistic Regression)')
plt.show()

# Plot confusion matrix for Support Vector Machine (SVM)
svm_labels = svm_grid_search.best_estimator_.classes_
plt.figure(figsize=(12, 6))
sns.heatmap(svm_conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=svm_labels, yticklabels=svm_labels)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Support Vector Machine)')
plt.show()

# Plot confusion matrix for Artificial Neural Network (ANN)
ann_labels = ann_grid_search.best_estimator_.classes_
plt.figure(figsize=(12, 6))
sns.heatmap(ann_conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=ann_labels, yticklabels=ann_labels)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Artificial Neural Network)')
plt.show()

# Plot confusion matrix for Random Forest Classifier
rf_labels = rf_grid_search.best_estimator_.classes_
plt.figure(figsize=(12, 6))
sns.heatmap(rf_conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=rf_labels, yticklabels=rf_labels)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Random Forest Classifier)')
plt.show()

Project Overview¶

In this project, we developed an innovative array of machine learning models (Logistic Regression, Support Vector Machine, Artificial Neural Network, and Random Forest) on Sleep Quality Data. The aim was to leverage machine learning techniques to analyze various sleep-related metrics and categorize sleep quality to provide personalized insights.

The following key steps and visualizations are implemented:

Dataset Preparation¶

  • Two separate datasets were utilized for this project.
  • The datasets were merged into one comprehensive CSV file.
  • Relevant feature engineering techniques were applied to enhance the dataset's informativeness.

Dataset Links:¶

  • Dataset 1 : Sleep Efficiency Dataset
  • Dataset 2 : Sleep Health and Lifestyle Dataset

Data Preprocessing¶

  • Load the sleep dataset from the CSV file.
  • Define weights for each feature to calculate sleep quality scores.
  • Rescale and categorize sleep quality scores.

Data Exploration¶

The dataset containing sleep-related features such as age, gender, sleep duration, sleep efficiency, and more is loaded and explored.

Visual Insights:¶

  • Gender distribution pie chart
  • Average sleep duration by gender pie chart
  • Age distribution histogram
  • Histogram of sleep duration
  • Scatter plot of age vs. sleep duration colored by gender
  • Distribution of sleep quality scores histogram
  • Distribution of sleep quality categories bar chart
  • Sleep quality score distribution by category box plot
  • Sleep efficiency distribution histogram
  • Scatter plot of sleep duration vs. sleep efficiency colored by sleep quality category
  • Violin plot of sleep efficiency distribution by sleep quality category
  • Correlation heatmap of numerical features
  • Cross-tabulation heatmap of categorical features

Feature Engineering¶

Enhancing Feature Representation:¶

  • Calculate sleep quality scores based on defined weighted numerical features.
  • Rescale sleep quality scores to a standardized range.
  • Categorize sleep quality scores into five categories: Very Poor, Poor, Average, Good, and Excellent.

Model Development¶

Logistic Regression Modeling:¶

  • Logistic Regression model is trained and optimized using grid search and cross-validation.

Support Vector Machine (SVM) Modeling:¶

  • SVM model is trained and optimized using grid search and cross-validation.

Artificial Neural Network (ANN) Modeling:¶

  • ANN model is trained and optimized using grid search and cross-validation.

Random Forest Classifier Modeling:¶

  • Random Forest Classifier model is trained and optimized using grid search and cross-validation.

Model Evaluation¶

Performance Assessment:¶

  • Evaluate model accuracy, precision, recall, and F1-score through classification reports.
  • Assess model performance using confusion matrices.
  • Analyze cross-validation scores to ensure model robustness.

Confusion Matrix Visualization:¶

  • A detailed confusion matrix is visualized for each model's performance assessment.

Conclusion¶

Innovation and Collaboration:¶

  • Abdullah Al Rakin and Mohammad Navid Nayyem collaborated on refining the IoT-integrated sleep quality monitoring system., integrating various machine learning models for accurate sleep quality categorization.

Future Prospects:¶

  • The project lays the groundwork for future advancements in IoT-based healthcare applications and personalized sleep monitoring systems, enabling individuals to better understand and improve their sleep quality.

Note: Contributions are distributed equally between Abdullah Al Rakin and Mohammad Navid Nayyem.